prometheus metrics for http webhook publishers by adonthi-fws · Pull Request #1149 · conductor-oss/conductor

adonthi-fws · 2026-06-03T06:51:21Z

Summary

Adds Prometheus/Micrometer metrics for HTTP webhook publishers (workflow_publisher and task_publisher), following the same Monitors pattern used by archive listeners (e.g. recordWorkflowArchived).

New metrics:

webhook_publish_success — HTTP 200/202 after publish
webhook_publish_failure — exception during publish
webhook_enqueue_failure — offer() returns false when buffer full
webhook_queue_depth — in-memory notification queue size after enqueue

Test plan

Enable conductor.workflow-status-listener.type=workflow_publisher and conductor.task-status-listener.type=task_publisher with a reachable webhook URL
Run a workflow to completion
Confirm metrics at /actuator/prometheus:
- webhook_publish_success_total
- webhook_queue_depth
(Optional) Point webhook URL at unreachable host and confirm webhook_publish_failure_total

adonthi-fws · 2026-06-04T07:32:37Z

Hello @v1r3n please review this

nthmost-orkes

Thanks for the submission! There are a few things to clean up first...

First, a bit of a consistency improvement to make here:

recordWebhookQueueDepth takes (notificationType, size) but the other three methods all carry a name parameter (workflow/task definition name). That means you can slice success and failure by name but not queue depth. Either add name to recordWebhookQueueDepth to match, or drop it from the others — whichever feels right for how you'd actually query these in Grafana.

nthmost-orkes

Quick question: can you double-check that webhook_enqueue_failure will actually fire in practice?

Both publishers use blockingQueue.put(), which blocks when the queue is full rather than throwing — so the only exception it can raise is InterruptedException (thread interrupted). If the queue fills up under load, put() will stall the caller but the failure metric will stay silent.

If the intent is to detect "queue full, item dropped," offer() is the right call — it returns false immediately when the queue is at capacity, giving you a clean place to record the metric and log a warning. Happy to be corrected if there's a different failure mode you had in mind.

nthmost-orkes

One more observation, non-blocking but worth doing right: the webhook_queue_depth gauge records a snapshot at enqueue time via AtomicDouble.set(). If the publisher thread drains items between enqueues, the gauge can read much higher than the actual depth at scrape time — you'd see a sawtooth that never fully reflects the live queue state.

A pull-style registration in the constructor gives you a live reading at every Prometheus scrape instead:

Gauge.builder("webhook_queue_depth", blockingQueue, Collection::size)
     .tag("notificationType", NOTIFICATION_TYPE)
     .register(Metrics.globalRegistry);

Then no call needed at enqueue time at all. The existing event_queue_depth gauge in Monitors has the same limitation, so this PR is at least consistent — but since you're adding new infrastructure here it's a good opportunity to set a better pattern.

adonthi-fws · 2026-06-15T10:54:21Z

Hi @nthmost-orkes
Thanks for the review!

Queue depth is one shared queue per publisher (TASK vs WORKFLOW), so a per-name depth tag doesn’t apply.
Kept name on success/failure/enqueue counters for per-workflow/task drill-down. Queue depth is tagged by notificationType only since each publisher has one shared queue; added Javadoc to explain.
enqueue failure — Switched put() → offer() so a full queue records webhook_enqueue_failure and logs a warning instead of blocking silently.
queue depth gauge — Replaced snapshot set() with pull-style Gauge.builder(..., queue, BlockingQueue::size) registered in the constructor.

nthmost-orkes · 2026-06-22T23:22:48Z

Hi @nthmost-orkes Thanks for the review!

Queue depth is one shared queue per publisher (TASK vs WORKFLOW), so a per-name depth tag doesn’t apply.
Kept name on success/failure/enqueue counters for per-workflow/task drill-down. Queue depth is tagged by notificationType only since each publisher has one shared queue; added Javadoc to explain.

enqueue failure — Switched put() → offer() so a full queue records webhook_enqueue_failure and logs a warning instead of blocking silently.

queue depth gauge — Replaced snapshot set() with pull-style Gauge.builder(..., queue, BlockingQueue::size) registered in the constructor.

Thanks for the update -- can you increase the test coverage on this feature? Then we should be good to go.

The three counters (webhook_publish_success / _failure / _enqueue_failure) don't have tests yet — only the gauge does.

The counter ones are cheap to add the same way the gauge test works: add a SimpleMeterRegistry probe, call the method, assert the tagged counter incremented (including the defaultIfBlank → "unknown" fallback). And TaskStatusPublisher has no test file at all today, so all of its new wiring is uncovered.

prometheus metrics for http webhook publisheres

2f3e075

adonthi-fws mentioned this pull request Jun 3, 2026

[FEATURE] Add Prometheus Metrics for Workflow and Task Webhook Publishers #1073

Open

Merge branch 'main' into issue-1073

2964897

nthmost-orkes reviewed Jun 12, 2026

View reviewed changes

code review requested changes

91d787e

nthmost-orkes added 7 commits June 15, 2026 13:41

Merge branch 'main' into issue-1073

9737f87

Merge branch 'main' into issue-1073

de0237e

style: apply spotless formatting to TaskStatusPublisher

704b2eb

Merge branch 'main' into issue-1073

b5f2274

Merge branch 'main' into issue-1073

531e3b9

Merge branch 'main' into issue-1073

452aa3b

Merge branch 'main' into issue-1073

89bbb4a

nthmost-orkes changed the title ~~prometheus metrics for http webhook publisheres~~ prometheus metrics for http webhook publishers Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus metrics for http webhook publishers#1149

prometheus metrics for http webhook publishers#1149
adonthi-fws wants to merge 10 commits into
conductor-oss:mainfrom
adonthi-fws:issue-1073

adonthi-fws commented Jun 3, 2026 •

edited

Loading

Uh oh!

adonthi-fws commented Jun 4, 2026

Uh oh!

nthmost-orkes left a comment •

edited

Loading

Uh oh!

nthmost-orkes left a comment

Uh oh!

nthmost-orkes left a comment

Uh oh!

adonthi-fws commented Jun 15, 2026 •

edited

Loading

Uh oh!

nthmost-orkes commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adonthi-fws commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

adonthi-fws commented Jun 4, 2026

Uh oh!

nthmost-orkes left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nthmost-orkes left a comment

Choose a reason for hiding this comment

Uh oh!

nthmost-orkes left a comment

Choose a reason for hiding this comment

Uh oh!

adonthi-fws commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nthmost-orkes commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adonthi-fws commented Jun 3, 2026 •

edited

Loading

nthmost-orkes left a comment •

edited

Loading

adonthi-fws commented Jun 15, 2026 •

edited

Loading